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[57] ABSTRACT 
In a multicomputer, concurrent computing system hav- 
ing a plurality of computing nodes, this is a method and 
apparatus for routing message packets between the 
nodes. The method comprises providing a routing cir- 
cuit at each node and interconnecting the routing cir- 
cuits to define communications paths interconnecting 
the nodes along which message packets can be routed; 
at each routing circuit, forming routes to other nodes as 
a sequence of direction changing and relative address 
indicators for each node between the starting node and 
each destination node; receiving a message packet to be 
transmitted to another node and an associated destina- 
tion node designator therefor; retrieving the route to 
the destination node from a memory map; adding the 
route to the destination node to the beginning of the 
message packet as part of a header; transmitting the 
message packet to the routing circuit of the next adja- 
cent node on the route to the destination node; and at 
each intermediate node, receiving the message packet; 
reading the header, directing the message packet to one 
of two outputs thereof as a function of routing direc- 
tions in the header, updating the header to reflect pas- 
sage through the routing circuit; and at the destination 
node, stripping remaining portions of the header from 
the message packet; storing the message packet; and, 
informing the node that the message packet has arrived. 

26 Claims, 6 Drawing Sheets 
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1 2 

passing multicomputer* such as the Caltech Cosmic 

INTER-COMPUTER MESSAGE ROUTING Cube [Seitz 85] and its commercial descendants is that 

SYSTEM WITH EACH COMPUTER HAVING of a binary n-cube (or hypercube) as depicted m FIG. 2 
SEPARATE ROUTING AUTOMATE FOR EACH which is used to connect N= 2- nodes 10. Each node 10 
DIMENSION OF THE NETWORK 5 has n=log2N connections, and a message never has to 

travel through more than n channels to reach its desti- 
ORIGIN OF THE INVENTION nation. 
The research described herein was sponsored in part Although the choice of the binary ™**J"J% r «| 
hvXcS Advanced Research Projects Agency. generation of multicomputer* is easily justified, the 
KaXo^o^™^ 10 analyses presented in a 1986 Caltech PhD thesis by 

Naval Research under Contract No. N00014-79-C- William J. Dally [Dally 86] showed that the use of 
0397 and in part by grants from Ametek Computer lower dimension versions of a k-ary n-cube [Seitz waj 
Research Division and from Intel Scientific Computers connecting N=k" nodes, e.g. an n=2 (2-D) torus or 
wherein said entities make no claim to title. It was also mesh, is optimal for minimizing message latency under 
done in partial fulfillment of the requirements for the 15 the assumptions of (1) constant wire bi-section and (2) 
degree of Master of Science of applicant Charles M. "wormhole" routing [Seitz 84b]. 
Flaig at the California Institute of Technology (Cal- these 2-D (or optionally 3-D) networks also have 
tech), to whom this application is assigned the advantage that each node has a fixed number of 

rfTFn pppprences ^ connections to its immediate neighbors, and, if the 

CITED REFERENCES 20 m ^ myed in two or three dimensions, the 

Dally 86 projection of the connection plan into the packaging 

Dally William J., "A VLSI Architecture for Con- medium has all short wires. Also, the number of nodes 
current Data Structures," Caltech Computer Science in such a machine can be increased at any time with a 
Technical Report 5209:TR:86. ,« minimum amount of rewiring. The low dimension k-ary 

w n-cube greatly decreases the number of channels, so that 
Dally A Seitz 86 with a fixed amount of wire across the bisection, one 

Dally, William and Charles L. Seitz, "The Torus may use wider channels of proportionally higher band- 
Routing Chip " Distributed Computing. Vol. I, No. 4, width. This higher bandwidth, particularly with worm- 
pp 187-196, Springer-Verlag, October 1986. hole routing, can more than compensate for the longer 

Mead & Conway 80 average path a message packet must travel to reach its 

destination. 

Mead, Carver A. and Lynn A. Conway, Introduction The time required for a packet to reach its destination 
to VLSI Systems, Chapter 7, Addison- Wesley, 1980. m a synchronous router is given by, 
~. tT « T„-T<(pD+[L/W]); where T c is the cycle time, p is 

MUt the number of pipeline stages in each router, D is the 

Seitz, Charles L., -Concurrent VLSI Architectures,** number of channels that a packet must traverse to reach 
IEEE TC, Vol. C-33, No. 12, pp 1247-1265, December its ^tination, L is the length of the packet, and W is 
1984. the width of a flow control unit (referred to hereinafter 

S*itz 85 40 as a "flit"). 

As an example, let us assume that there are N=256 
Seitz, Charles L., *The Cosmic Cube," Commumca- no des, 512 wires crossing the bisection for communica- 
tions of the ACM. Vol. 28, No. 1, pp 22-23, January tion (0^^^ overhead from synchronization wires), 
1985. 

a message length of 20 bytes (Le. 160 bits), and an inter- 
BACKGROUND OF THE INVENTION 45 ^ 2-stage pipeline, The bisection of a binary hyper- 

cube has 128 channels in each direction, each with a 
This invention relates to intercomputer message pass- 0 f 2 bits, and an average of (log2N)/2=4 nodes 

ing systems and apparatus and, more particularly, in an ^ ffiUSt ^ traversed, so that 
intercomputer routing system wherein message packets T« = (2 X 4+ 160/2)T C = 88T*. By comparison, the bisec- 
are routed along commumcatkms paths from one com- ^ 0 f a 2-D (kxk) mesh, where k= 16, has 16 chan- 
puter to another, to the improvement comprising, a ^ls in each direction, each with a width of 16 bits, and 
routing automaton disposed at each computer and hav- avcragc of (2k/3)~ 1 1 nodes must be traversed, so 
ing an input for receiving a message packet including » 11 + 160/16)^=32^ Thus, the binary 

routing directions a, a header thereto and a plural ty o ^^^^^^ over see theW- 
outputs for selectively outputtmg the „ *" bidirectional mesh network with the 

a function of the routing directions in the header, and, 55 ^"^^^^ 

routing logk means disposed ^J^ u f ^«^^ ^^toZg Chip (TRQ designed at Caltech 
aton for reading the header, for directing the message }«5 rSaUv A Sete 86T ned uiu^re^onal channels 
p^kettoo^oftheou^uu^ ^K^a^ 

better to reflect use passage 01 ^ mc3M * c pmMl entitled Torus Routing Chip by Charles L. Seitz and 

with vJ^rno^^d^icted in FIO. UtiTpncticl and assigned to the common antgnee of Ous appbeatwn, 
« ^TK^n^on«beme between tte^des the teachings of which are if^^.^^^ 
lO^reof. A full intereonnectkm of channel, quickly 63 ence. As depicted in FIO. 3. the torus u ito™ folded 
become7iinp«cticaJ as the number of nodes increases. in iti projecdon onto a coniinonpUne w <^***g 
dnceeach node of an N node nuchine must h»ve N- 1 all channels the same length. Deadlock (• major consd- 
connections. A configuration used for larger message- eratioa in multicomputer*) was avoided by using the 
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thus .voiding cyclic d^d«««j^ttep^tyof ^^^^ S^Sm^ls to announce the 

deadlock. Hie TRC was self-timed to avoid the prob- bet ^.~**,?*Xir wuh 720MHz system dock 

Seated -th<^g.^^aUrge , sva££» ^Itc^et 'cM^S^ogy? S 

network. There ^ * ^ n ^T?^c^a- SH* was expected to be .bout 2MB/s on each 

"p^foutpu^^ S^rcl^S data area (on the chip) only to achieve fairly dismal perfor- 

lines, the TRC achieved a data rate of20MB/s. Each -anc^ invenUon to 

Stetim^* "ff-5*Afc mance witha niinimum amount of m1.cc area on the 

d^r^e^ft^Twor^SgMf -ting message p~ket. in . message-p-sing. mul- 

the desired output channel U unavailable, the message is tic«nputo sytoL inven.ion to 

circuitry for communication with other chips. Each SUMMARY OF THE INVENTION 

S^each cWd^ Vo^im^hTnumber of con- transferring the mes»ge packets to and from the RAM. 

pun on a ^J^ t ^°^°~r^^~T ^ OTlaail£ ,j, c jo by the improved method of routing the message packets 

^^iJ^smSy^S^ofT^utog bLeen the node, comprising the steps of, providing a 

""mT^^««[^ tta 6^ BitAi in the routing automaton at each node and interconnecting the 

^mar^ c^be cOTUmrf m the ^^V* ™ routini automata to define communications paths inter- 

TRC the J^^Ciogto JJJJ "Sing the nodes along which the message packets 

rS'o^S^P^co^^rSSt » c^Tuted; and at each routing automaton, receiving 

ZJL tuhto rdativc K afttaiei of the desti- a message packet including routing ■brectwns compm- 

^^^«T^S^nn^ c^»tofflts consisting ing the header at an input thereof, reading the header, 

nation and an ^^*^Jlf"3; ifirectrng the message packet to one of two outputs 

of a 1M* data wad and ♦ff^^.J*™^ SSLiftir^onof Se routing directions contained 

^ b^tetp^u^g ao^S»text p^ces- 65 3* •^SSStS^S 

K^h^^U^^^ S P-ke. -A transmitting the message packet to 
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the routing automaton of the next adjacent node on the FIG. 16 is a^Ual block diagram of a FIFO stage 
route tofce destination node. Additionally in the pre- as employed in the asynchronous approach to the pres. 

^^TLST ^« ~JStt. functional ««h diagram of the.yn- 

node: suipring r«Lming portionTof the header from 5 chronous routmgjutomaton of the present mvention in 

Sle mSJiLket; etoitagtte message packet in the its built and tested embodiment employtng binary 

RAM; and, informing the CPU that the message packet switching logic. 

is in the RAM. DESCRIPTION OF THE PREFERRED 

The preferred method at each packet interface also EMBODIMENT 

comprises the additional step of, storing » memory map 10 . . 

^locations of the otheT nodes in the system and a The present mvention » based on two ma^ord e v,a- 

corapoTding route to each node; wherein the step of tions torn the prior art The first was the rejection of 

addkg^S directions as a header to the beginning employing the , i and y addresses of the drtnaj^ofa 

offemessu? packet comprises, receiving a destination packet in the header m favor of a prefix encoding 

nodTSto^om t^C^ requesting^ transmis- 15 scheme which would the packet header toenccde 

sioTofSwrnessage packet; retrieving the route to the the relative address of the destination in the form of a 

Sstion nodTfLrithe m<^ map; and, adding the path "map- on several small successive «P- 

rouTto the destination node to the beginning of the proach was the key to M ™^ P"*"* 

message packet as part of a header. h«vtng to see a large amount of header informauon 

^» ^ jo before any logic could decide where to send the head of 

DESCRIPTION OF THE DRAWINGS: the packet. This change also simplified the required 

FIG. 1 is a simplified drawing depicting a prior art routing logic circuitry enough to allow it to handle 

computer system wherein each node is connected di- forwarding automatically on a local basis without hav- 

rerf^toweryother node. disturb the CPU. The novel dead ock-free rout- 

FIG J iU a simplified drawing depicting the node 25 ing method decided upon (s design cntena "must") was 

interconnection scheme employed in a sc-called hyper- to send packets on a fixed route on a mesh of first x. and 

cube according to the prior art then y, instead of employing virtual channels to avoid 

HOS k» a simplified drawing depicting a prior art deadlock. Initial designs involved 5-bit flits (4 data. 1 

torus routing cMp mterconnectfon scheme.. control, with an acknowledge w« in the reverse du-ec- 

FIG 4 is a simplified drawing of a single, one dimen- 30 uoh) to be sent in parallel on each channel, and to be 

sional routing automaton according to the present in- internally switched using a crossbar switch, 

mention The second major deviation from the pnor art was. 

FIO. 5 is a simplified drawing showing three routing the early rejection of the crossbar switch to favor of a 

automata of FIG. 4 connected in series to control new formulation which has been designated as "routing ; 

JickeTmovement in three dimensions. U automata". It was realized early on that a crossbar 

FIG. 6 is a simplified functional block diagram of a switch, while very general, had the disadvantaged . 

fine-grain computer chip according to the present in- taking up space (on the chip) proportional to {vWy, * 

vention where n is the number of inputs and outputs, and W is 

FIG.V is a drawing of an exemplary data stream as it the number of bits being switched. With a fixed routing 

passes through a series of nodes according to the 40 scheme being employed, the generality of a crossbar , 

m*hod of packet routing of the present invention. switch was not needed and it was highly desirable 0,4 

FIG. 8 is a simplified drawing to be employed with devise a scheme in which the ^ con^rf would 

FIG. 7 to follow the example of FIG. 7. only increase linearly with nW, or as close ««P«sAb * 

RG. 9 is a simplified block diagram of the internal thereto. This sealing would make it easier to modify the , 

structure of a routing automaton according to the pres- 45 router for more dimensions or wider flits without in- 

ent invention in one possible embodiment thereof. volving major layout changes, would decrease the path v 

FK3. 10 is a simplified block diagram of the internal length in the switch and hence its speed, and would, 

structure of a routing automata according to the present hopefully, decrease the overall area for designs with 

invention in a preferred and tested embodiment thereof. wide flit* and a large number of dimensions. 

FIG. H is a simplified block diagram of the internal 50 As depicted in FIG. 4, each of the automata 12 is 

structure of a routing automaton according to the pres- responsible for switching the packet streams for one 

ent invention in an embodiment thereof intended for use dimension of the overall router. An automaton s input, 

rn» torus routing chip system of the type shown in FIG. generally indicated as 14, consists of streams from the 

2 • K + and - directions as well as from the previous dimen- 

" piG. 12 is a simplified block diagram of the internal 55 ston. Its output, generally indicated as 16, consists of 

structure of a rooting automata according to the present streams to the + and - directions as weU as to the next 

u^v^uanernbodioieMtlW dimension. For n dtaenskms, n of the automata U ere 

a bypercube as shown in FIG. 1 strung together m senes as depicted in FIO. 5; and, if 

FI0131S a drawing corresponding to the embodi- properly constructed, thar sue, i.c the area consumed 

mem of FIO. 10 aim*kW«^onoit8 thereof in CO on the chip, increases roughly linearly with increased 

their connected senueacHs incorporated teto a cMp width of the flits, for a net increase in area P^™™ 1 

Uridout, built and tested by the applicant herein. to nW, as desired. Both synchronous and Mynchronous 

PICK 14 is a functional block diagram depicting a <Le. self-timed) versus <rf ^automata 

stage of a synchronous pipeline system as b known in present mvention will be described shortly. The self- 

nagcei ■ »j™ 7 a 1iata version is intended to have each of its compo- 

FIO. 15 is a functional block diagram depicting a aents highly modular so that they could be used not 

stage of an asynchronous pipelined approach to the only to implement mesh routing, as in the Mosaic C 

p^ti^w chip, but could also be fit together to implement unidi- 
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rational routers, routers Jhypcxcube,. or nuny other ^^J^J^^^Si^ 
savctu ^ .united only by the f^o?fl?E r^XCSSXl^lU 
The basic components include a FIFO for we oe- ^^J NATI ^v {aThe tail (T) which makes 
tween stages and buffering a ^^.^^^^ , S^p^ ck»« aHf U« channel con- 

.nd merge d»U *~ » 1^^,^^?^ 5 n^Ta. ^jSLT^ghTe nodes 10, and is Finally 
relative address of the destination as the packet passes «^°"J ^ rt^FSTTNATION node 10 along with 
through, and control structures for aU o th^oregouig S^S^vc data flits as 

Before continuing with specifics of the automata, how- ^.^"^i^fT . te appreciated from this 
ever, the novel prefix encoding ^J^T"* .„ Z^oL^J^ofto^nl invention 

invention as employed there* wdl be addressed first u, 10 nits to represent ^ offsets 

^^•JSSjw^^S- » SSrf £. ™Sg dUloy » .mom... IM 

by the letter M .art Jme ™ ™ to 7 ^ p ,^ nd M ^ gtreams (1) to the + direction. (2) to the - direction 

picket as it leaves *e node Usted to ue «^»° a ^ and (3) to the next dimension at the outputs 16. 

depicted in FIG. S. with "SOTOCE ^g^tmg automala u which tave three input, and 

node 10 sendmg the . packe t-d Areeoltpu^and which are composed to make 2-D or 

indicating the node 10 mtended to reoewe ^^natt, ^ ftemselves be composed of simpler 

Tune » from the top ^» ™" l£^^ 0 X a^uHnteUnut. the routing automata most in- 

that as header flits areno longer needed as part of the ™^T^ a dccisiotl dement with one input 

«mp" they are «r«W*dcrff. , * Sream and two output streams and a merge element 

In the example. £• 223^iwfir> wUhtwo input streams and one output stream. Other 

r"???' * i. £!|2«Ui Dtnins through take their input streams and arbitrate which of them to 

it^lX***^^^ -d merge into *e strearn exiting in the + 

additional nodes to proceed in the «^ d ^™> „ ^ ^ structU re of the automata 

^^^'SSXi^^X^^ in ^ctErSer simpUfy the dogn and layout in 
S^I^u^^S^ho^r, the next flit the same manner a. breaking up the router mto a senes 
of decrementing 10 w. in „. lh , D f i.n automata. Even when parts cannot be directly 

b the l«*dingzero desyma»r • ""^K Se reuse! uTTafte. be saved^employing modific 
^i^t i^Stt J tn^S ft2n « STSuher than a complete redesign. In the extreme 
+ direction in tfte v ^ i-hi— -a— —ch of the decision and merge opertuons can be 

^St^tr^^J^SpP^ cTve^eo ^to bin^fonn, where^elements are 
Se^eSr ^aamf p^T^«^ the replaced by cascaded binary elements. In tlus case, the 
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elements become very homogeneous and the automata 
can be formed out of a minimal subset of very simple 
dements. This approach is the one used in the self-timed 
routing automata to be discussed in detailed later herein. 
These automata can also be constructed for different 
channel configurations using the same set of internal 
elements. For example, they can be constructed for 
unidirectional channels as used in a torus or a hyper- 
cube. Examples of the internal structure of such autom- 
ata are shown in FIGS. 11 and 12 where FIG. 11 de- 
picts a urn directional automata 13' for use in a torus 
routing scheme and FIG. 12 depicts a hypercube rout- 
ing automata 12". 

The invention of the routing automata and its associ- 
ated method of operation in routing packets provided 
the opportunity of constructing a novel chip for the 
Mosaic C application as well; that is, the present inven- 
tion includes a novel chip for containing each node in a 
fine-grain, message-passing, multicomputer, concurrent 



10 



to 



15 



combine the decision and merge operations in a slightly 
different manner than in the sample automaton 12 
shown in FIG. 9. The actual preferred configuration is 
shown in FIG. 10. In this tested configuration, gener- 
ally designated as IT", the decision and merge opera- 
tions were lumped together into one switch matrix, 
which handles both the multiplexing and demultiplex- 
ing of the data streams, with 4 minimally encoded con- 
trol wires and their complements selecting one of the 
possible switch configurations. Although the combined 
switch approach may possibly minimize the overall size 
of the router by homogenizing the data path layout jfor 
the. different channels, it also complicated the control 
circuitry. Thus, while it may have been the best ap- 
proach for a 1-bit slice path, those implementing the 
present invention for wide data path slices may find that 
it is not a good approach for such applications. 

In an attempt to minimize the overhead of extending 
the width of a flit, the Mosaic router as built and tested 



computing system wherein a plurality of computing 20 employed an approach of constructing the data path out 



nodes are each contained on a single chip. This "Mosaic 
chip" 22 is shown in functional block diagram form in 
FIG. 6. The Mosaic chip 22 is a complete node for a 
fine-grain concurrent computer. It contains all of the 
necessary elements including a 16-bit CPU 24 several 
KBytes of ROM 26 and dRAM 28, as well as routing 
circuitry comprising a packet interface (PI) 30 and 
router 32 for communicating with neighboring nodes in 
a mesh. All of these elements are tied to a common bus 



of 1-bit wide slices, with the +, — , and N paths for each 
bit being placed immediately next to one another. This 
pre f er red approach allows the same switching elements 
to be used no matter how wide the flit is. On the nega- 
25 tive side, it also means that the control signals for all 
three data paths had to be propagated through all of the 
elements, and this led to a somewhat larger overhead in 
wiring than is necessary. It also complicated the layout 
of the control circuitry since it had to fit in an effec- 



36. Since the router had to fit on a chip along with a 30 tively smaller pitch. The data path is made up of a n uni- 



processor and memory, the design had to be simple and 
compact— the automata-based routing scheme of the 
present invention as described hereinbefore provided 
such a capability. The PI 30 takes care of encoding the 
packet header and transferring packet data to and from 
memory. A simple cycle-stealing form of Direct Mem- 
ory Access (DMA) is preferred to keep up with the 
high data rates supported by the router. In the preferred 
embodiment, the PI 30 also contains some memory 



ber of elements as shown in FIG. 13 (which corre- 
sponds to the embodiment of FIG. 10 as actually laid 
out and implemented on a chip). Each element, of 
course, comprises three portions (for the +, — and N 
35 paths) as indicated by the dashed lines dividing them. 
When connected sequentially in the order shown, these 
elements form a complete data path for a Mosaic rout- 
ing automaton. The inputs are to an input latch 38 fol- 
lowed by an input shift register 40 as required to prop- 



mapped locations that are used to specify the relative x 40 eriy interface with the preceding stage. This is followed 



and y addresses of the destination node, and an interrupt 
control register. 

The PI 30 generates the appropriate direction control 
flits and multiplexes the relative x and y addresses in its 
registers into the flit width required by the router 32. It 43 
does the same multiplexing for the packet's data words, 
which come from an output queue in the dRAM 28. A 
tail flit is added when the output queue becomes empty. 
These flits are injected into the previous dimension 



by the zero and tail detection logic 42. Next follows the 
decrementer and leading zero generator logic 44. The 
stream switching element 46 follows next. As indicated 
by the dashed arrows, the stream switching element 
portions operate in the same manner as the three deci- 
sion elements 18 of FIG. 10; that is, the two outer 
switching element portions switch between straight 
ahead and towards the center path while the center 
switching element portion can switch between straight 



input of the automaton for the first dimension. Any 50 ahead and either of the two outer paths. The properly 



packets coming out of the last dimension simply have 
their tail stripped off and the flits are demultiplexed into 
a 16-bit word, which is then stored in an input queue in 
the dRAM 28. The CPU 24 may be interrupted either 



switched packet paths from the stream switching ele- 
ment 46 then proceeds to an output latch 48 and an 
output buffer/disabler 50 as required to properly inter- 
face the asynchronous automata to the next node with- 



when the output queue b eco m es empty, when a tail is 55 out destructive interference, as described earlier herein. 



received, or when the input queue becomes folL 

As will be described in greater detail hereinafter, the 
Mosaic router 34 communicates with other nodes using 
a 3-bit wide flit (2 data, 1 control) with an acknowledge 



Turning briefly to the packet interface (PI) 32 men- 
tioned earlier herein, using the minimum sized flit data 
width of 2 bits and the Mosaic word size of 16 bits, the 
inventors herein found that the router of the present 



wire in the reverse direction to control movement be- 60 invention can deliver one word every 8 cycles. The data 



tween stages of the p refer red pipelined design and pre- 
vent overwriting from one stage to the next Any time 
an acknowledge is present, flits are allowed to pr ogre ss 
through the pipeline. The flit can also be made wider to 
include more data bits. As presently contemplated, the 65 
first "production** Mosaic chip will employ a 5-bit flit 
As a result of the bit-slice design employed in a tested 
embodiment it became more efficient in the datapath to 



rate becomes even faster if wider flits are used. There is 
no way that the CPU 24 can keep up with this data rate 
under software control Therefore, it was decided that 
in the preferred approach a simple form of cycle-steal- 
ing DMA (as n known in the art) should be used to 
transfer packets between the router 34 and memory, i.e. 
dRAM 28. For this purpose, four extra registers were 
added to the CPU 24 to be used as address pointers and 
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limit register* for the input and output channels. Each waa placed in • 132 pin PGA package. The rananung 
time the storage bus is not being used by the CPU 2* pins are used for a react and for multiple Vdd and GND 
(about once every 3 or 4 cycles for typical code) the pins to minim irr noise. 

microcode PLA emits a bus release signal. A simple Pipelining is wefl known in the art and used in many 
finite state machine then arbitrates between bus requests 5 synchronous systems to increase their throughput, 
from a refresh counter, the input channel, and the out- [Seitz S4J. Each cycle, each stage of the pipeline accepts 
put channel and grants the bus cycle to one of them. If data from the previous stage, performs some relatively 
a channel is given the cycle, it pulls on a line which simple operation on the data, and passes the resulting 
causes the corresponding address pointer in the CPU 24 data on to the next stage Data is passed between stages 
to be placed on the address bus, and the channel then 10 during each cycle by clocked registers. A typical syn- 
reads or writes data from that location in the dRAM 28. chronoos pipeline section is depicted in FIG. 14. The 
The address pointer is then incremented and compared combinational logic in each stage has one clock period 
with its limit register. If the two are equal, the DMA in which to produce valid output data based on its input 
logic is disabled and the CPU 24 is interrupted to pro- data. This time, T* is the same for all stages in the 
cess the I/O queues. If an interrupt occurs when an 13 pipeline, and the time required for data to flow through 
output packet word is requested, a tail is sent following the pipeline is T»=T<p, where p is the number of stages 
that packet m pipeline. ^ . 

A packet is sent by setting the output address pointer A similar arrangement can be used in an asyncnro- 
to the starting location of the desired data in the dRAM nous system. Instead of a global clock, the 4-cycle re- 
28 and setting the limit register to the location of the 20 quest and acknowledge signals [Mead & Conway 80] 
end of that data. The CPU 24 then writes the relative are used to control data flow between stages, as shown 
address of the destination node of the packet to memory in FIG. 15. In an asynchronous pipeline, each stage 
mapped locations in the channel (using the sign and processes data at its own rate, and passes its output data 
magnitude form described above). When the last loca- to the next stage when it is finished. Each stage, there- 
tion is written, the header is encoded and sent, with the 23 fore, has its own cycle time, to which is the time it 
data following. Data is best received by setting the requires to complete its request and acknowledge 4- 
input channel pointer to the starting location of a queue cycle. Each stage also has a characteristic fallthrough 
and setting the limit register to the end of the queue. time, t/t which is the time required from when an input 
The CPU 24 can then examine the value of the address request is received until the data is processed and an 
pointer at any time to see how many words are in the 30 output request is generated. The ratio of tc/t/dctermines 
queue. Currently, there is no provision for marking a across how many stages a cycle (and an item of data 
tail, so if explicit knowledge of the length of a data being worked on) extends. By necessity, t/<tc and for 
packet is required, one of two methods must be use— (I) most practical designs U~ltf 

the length of the packet can be encoded in the first word Looking back at the three dimensional automata se- 
of the packet, which the CPU 24 can then examine, or 33 ries shown in FIG. 5, it can be seen that data flows in 
(2) the CPU 24 can be interrupted when the tail of a only one direction within each automaton 12, and the 
packet arrives so that the interrupt routine can examine different paths are independent (except for merge oper- 
the input address pointer register to determine the ations). Thus, it is an easy transition to think of routing 
length of the packet automata as being implemented using a pipeline struc- 

Much of what was learned from the design of the 40 ture, as in the MRC. As established earlier here, the 
Mosaic synchronous router can be applied to an asyn- transit time, from source to destination, of an unblocked 
chronous routing automaton. An asynchronous router packet in the synchronous case is given by, 
can be used in physically larger systems, such as second- T„«Tc(pD+L/W). For the asynchronous case, some 
generation multicomputer*, in which the interconnect of these terms are changed because the head of a packet 
tions are not limited to being very short wires. The 43 advances at the fallthrough rate, which is less than the 
mesh routing chip (MRC) now to be described is de- cycle time. The formula for network latency of the 
signed to meet the specifications for these second-gen- asynchronous case can be expressed as, 
eratkm multicomputers. These routing automata are T„=T/D+T f r l-/W]; where T/is the fallthrough time 
intended to be a separate chip, similar to the Torus for a node. For relatively short packets, D is compara- 
Routing Chip mentioned earner herein, as opposed to 50 ble to L/W, so there is no strong motivation to reduce 
being part of an integrated "total node" chip such as the either T/or T c at the expense of the other. T/and T f can 
Mosaic chip described above. Aa in the Mosaic router, be expressed as, T e =s2tp-f Wand T/st,+Vp; where, tpis 
the 2-D MRC has 5 bidirectional channels, with chan- the time required to drive the pads, p is the number of 
nels in the +x, -x, +y, -y directions and a channel stages in the internal pipeline, and t/and W are the aver- 
coanccting it to the packet interface. Data is repre- 53 age fallthrough and cycle times, respectively, for a 
tented on each of the channels using 9-bit wide flits (1 single stage of the pipeline, as described above. A pad, 
tail 8 data), where the first bit is Che tail bit and the external components connected to it, are rela- 

For a 2-D router, the first two flits form the header. tively difficult for a VLSI chip to drive, so tp > > W> t> 
h each header flh; a relatrveornrt of s^ This means that the number of stages in an asynchro- 

up to sixty four nodes along a single dmienaion, which 60 nous pipeline can be increased without significantly 
should be sufficient for any second generation machine increasing the overall delay. In the case of the MRC, 
with large nodes. The seventh data bit in a header flit is increased pipelining has a significant advantage in that 
reserved lor the future addition of broadcast support having more pipeline stages provides the network with 
and the eighth bit is the sign. A Writ flit together with more internal storage for packets, and consequently, 
the asynchronous request and acknowledge signals for 65 helps prevent congestion of the network. 

channel requires a total of eleven pins. Five direc- The preferred asynchronous FIFO structure em- 
tmnal channels (+x. -x, +y, -y, and the node) then ployed in the present invention is based on chained 
require 1 10 pins, and the constructed vento of the chip Muller Cciements [Mead & Conway 80]. Its basic 
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structure is shown in FIG. 16. Care must be taken to decnmenten &n\aa than the cycle time of a C-ele- 

insurc thai the rcsmter cells 54 controlled by the C-ele- ment and Schmitt trigger combination. 

menS on or off before request and Finally, it should be noted that. mternaUy. the -utom- 

aX^ied7c .^>b « generated and that they are ata of the present invention use 4-cycle *^*J°[ 
flT enough to laich the daU before the load line 5 flow control; but. to increase speed and conserve 

SnS^Se^ain. The first requirement can be taken power, signals sent off chip must use a ^eyefc : conven- 

cmui^ra "*"r^ . M „tt:^Z!* Am\«u m thf n-miMt don. A small amount of conversion must be done, there- 

care of by ^*^8 ' 5 SfuZ^ fc£ beforTdriving the pad,. This conversion also adds 

and acknowledge toe* or '^SS'Jl LZfas a Sail amount of delaVfo the request/acknowledge 

Schmitt trigger (a gate with °° «*J - ^ which be lps ensure that the data is valid by the 

is known m the art) to detect the state of the load con- l*^ „_ tf ^ dcUys i„ the lines 

trol line. sliahtlv skewed. If the delays are skewed by a large 

Initially all of the Cements 56 are reset to a Dam ^^f^funlped RC dday can be added to *e 

is presented on the inputs, and the request line (Ro) is ___. ii-. eternal to the chin 

puSed high. This causes the output of the first Oele- the first Asynchro- 

ment 56 to be pufled high, ^usmgthe date to be ^ werc Xiitted in September 1987 

latched When this load control line becomes high, the ™ fobrication in a 3 micrometer CMOS 

data is assumed to be latched, a request is passed to ^ fabdcated ^ purged chips were re- 

the next stage, and the acknowledge line (Ac) to the ^ Novcmber mr Th cy functioned correctly at 

previous stage is pulled high. When the next stage ^ a 8peed of approximately 10 Mflits/s. Production chips 

latches the data and an acknowledge (Aj) * recwed fabricatcd 1^ in a 1.6 micrometer CMOS process 

from it and the request line (Ro) goes low, the FIFO ^pcrM^ at about 30 Mflits/s, about three times faster, 

state is reset to its initial condition. In this manner, the subsequent design refinements in layout and implemen- 

data quickly falls through the chain of FIFOs, with the ^ve increased the potential speed of the MRC to 

data always spread across at least two stages. If the 2J 43 Mflits/s in 3 micrometer CMOS. This speed obtained 

request time is significanUy less than the acknowledg- by a prototype of FIFO and 2/4 cycle conversion cir- 

ment time, then the flit will be spread across more than cuitry ^^mitted for fabrication in February 1988 and 

two stages while it is falling through the pipeline. returned and tested in April 1988. In a 1.6 micrometer 

The second requirement of fast latches is usually easy CMOS process, these designs are anticipated to operate 

to meet, since latches are generally much faster than w ^ approximately 100 Mflits/s. The primary limitation 

C-elements (and Schmitt triggers). The registers 54 used at gpeeds is the lead inductance of PGA packages, 

in these FIFOs consist of a closed loop of a strong and improved packaging techniques should allow such 

weak inverter 58, with the data gated to the input node cmpg to operate at 150 Mflits/s. 

of the stronger inverter 58 as is known in the art. Thus, Wherefore, having thus described the present inven- 

the data is inverted at each stage of the FIFO, and these 35 wna t is claimed is: 

stages should be used in multiples of two to preserve the t An inter-computer message routing system 
sense of the data. In order to save space in the tested wherein message packets are routed among a plurality 
embodiment, the register cells 54 are flip-composed of computers along communication paths between said 
vertically, for data path widths that are multiples of 2. computers in an n-dimensional network of said commu- 
In the MRC the path width is 10 bits, so' there is an ^ nication paths, different groups of said communication 
extra bit available for propagating information between paths comprising different ones of the n dimensions of 
stages of the pipeline, if it becomes desirable to do so. ^ network, said message packets each comprising a 
For 1.2 micrometer CMOS, it is expected, from model header containing successive routing directions relative 
calculations, that each FIFO should have a t/(lc fall- to successive computers along a selected route in said 
through tune) of about 1 ns. This follows the assump- 45 network, said system comprising: 
tion of t/<tp (pad driving time), which is about 5 ns a plurality of routers, each router being associated 
(more with a large load or long connection line), so that with a corresponding one of said computers, each 
extra stages of pipelining do not add significantly to the of said routers comprising n routing automata cor- 
latency of a packet passing through a node. responding to said n dimensions, each of said n 
For simplicity and easy modularity, binary decision 50 routing automata having plural message packet 
and merge elements were used in the asynchronous inputs and plural message packet outputs, at least 
automata as built and tested by the applicant herein. some of said inputs and outputs being connected to 
With careful design, it was possible to use basically the respective communication paths of the correspond- 
same switch for both dements, simply by flipping it ing one of said n dimensions, said n routing autora- 
sideways. Each section of the switch consists of a sun- 53 ata being connected together in cascade from a 
pie t-to-2 multiplexer (or 2-to-l demultiplexer), and message packet output of one to a message packet 
enough of these are connected along a diagonal to ban- input of the next one of said routing automata cor- 
dle the width of a flit Because of the use of binary responding to a sequence of diinoisioas of the rout- 
switches, the internal coiistruction of the tested asyn- ing atfomatt; 

chronous automata IT" is as shown in FIO. 17. The 60 routing logic means oisposed withm each one of said 
decrementen 60 are a simple asynchronous ripple bor- routing automata, said routing logic means corn- 
row tvofc with a line that is pulled low to indicate com- prising means for reading the header of a message 
oletion Completion is defined by a stage that receives a packet received from one of the inputs of said one 
borrow in, and produces no borrow out because of routing automata, means for directing said message 
having a 1 on Us data input This completion signal is 65 packet to one of said outputs of said one routing 
useo^ generate the request signd fw the next stage in automata in accordance with the contents of said 
^nif^^with^rt^Hinthtr^O^th header, and means for modifying ^J^" J° 
as^jmedAat the forward>ropagation through the reflect the passage of said message packet through 
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said one routing automata, whereby each of said 
routing automata performs all message routing for 
the message packets traveling in a corresponding 
one of said dimensions. 
X The system of claim 1 wherein said means for di- 5 
recting said message packet to one of said outputs com- 
prises plural decision element means each having one 
decision input and plural decision outputs for directing 
a message packet received at said decision input to one 
of said decision outputs in accordance with the contents *° 
of said header, and plural merge element means having 
plural merge inputs and a single merge output for di- 
recting message packets received at the plural merge 
inputs thereof to said merge output without interference 
between said message packets. 13 

3. The system of claim 2 wherein within a single one 
of each routing automata at least some of said decision 
inputs are connected to respective message packet in- 
puts, at least some of said merge inputs are connected to 
respective decision outputs and at least some of said 
merge outputs are connected to respective message 
packet outputs. 

4. The system of claim 1 wherein said header com- 
prises a sequence of symbols and said means for modify- u 
ing said header comprises: 

means for decrementing a leading symbol in said 
header whereby said header specifies a direction 
for said mwygg packet relative to a next one of the 
computers along said selected path. M 
3. The system of claim 4 wherein said means for mod- 
ifying said header further comprises means for stripping 
no longer needed portions of said header. 

6. The system of claim 4 wherein said sequence of 
symbols include symbols each comprising a count dec- 33 
remen table by said means for modifying said header, 
said count specifying a duration of travel in a constant 
dimension. 

7. The system of claim 6 wherein said system permits 
bi-directional travel in each one of said dimensions, said 44 
sequence of symbols further including binary direc- 
tional symbols specifying one of two directions. 

8. The system of claim 6 wherein said means for mod- 
ifying said header comprises means for decrementing a 
leading non-zero one of said counts and wherein said 45 
header specifies a change to the next one of said se- 
quence of dimensions upon two consecutive leading 
counts thereof preceding one of said binary directional 
symbols being decremented to zero. 

9. The system of claim 8 wherein said means for di- 50 
recting said message packet comprises means respon- 
sive to said header specifying a change o the next one of 
said sequence of dimensions for directing said message 
packet to a message packet output connected in cascade 

to a message packet input of a next routing automata of 33 
said next dimension, 

10. The system of claim 3 wherein: 

travel by said message packets along said paths is 
mono-diiectional and said system is one of a group 
of networks co mprisin g 2-dimensional mesh net- 60 
works and 3 dimensional torus networks; 

there are two message packet inputs in each routing 
autotmata comprising one message packet input 
which receives message packets traveling in the 
corresponding dimension and another message 65 
packet input connectsble to a message packet out- 
put of a preceding routing automata connected in 
cascade therewith; and 
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there are two message packet outputs in each routing 
automata comprising one message packet output 
which transmits message packets for travel in said 
cor re sp onding dimension and another message 
packet output connectable to a message packet 
input of a succeeding routing automata in cascade 
therewith. 

It The system of claim 10 wherein: 

there are two decision element means and two merge 
element means in each of said routing automata; 

each of said decision element means has two decision 
outputs and each merge element means has two 
merge inputs. 

IX The system of claim 3 wherein: 

travel by said message packets along said paths is 
bi-directional; 

there are three message packet inputs in each routing 
autotmata comprising first and second message 
packet inputs which receive message packets trav- 
eling in first and second directions in the corre- 
sponding dimension respectively, and a third mes- 
sage packet input connectable to a message packet 
output of a preceding routing automata connected 
in cascade therewith; and 

there are three message packet outputs in each rout- 
ing automata comprising first and second message 
packet outputs which transmit message packets for 
travel in first and second directions in said corre- 
sponding dimension respectively, and a third mes- 
sage packet output connectable to the third mes- 
sage packet input of a succeeding routing automata 
in cascade therewith. 

13. The system of claim 12 wherein: 

there are at least three decision element means and 
. three merge element means in each of said routing 
automata; 

at least one of said decision element means has three 
decision inputs including first and second decision 
inputs corresponding to said first and second direc- 
tions and a third input corresponding to said pre- 
ceding dimension; 

at least one of said merge element means has three 
merge inputs of which at least one is connected to 
one of the three decision outputs of said one deci- 
sion element means having three decision outputs. 

14. The system of claim 1 wherein each of said com- 
puters comprise an integrated circuit including the cor- 
responding router, a processor, a memory, a packet 
interface connected to said router and a bus connecting 
said packet interface, said memory and said processor, 
wherein: 

said processor comprises means for retrieving infor 
matkm data from said memory to be routed in said 
network in a message packet and for computing an 
initial header based upon destination instructions 
stored in said memory; and 

said packet interface comprises means for forming a 
data stream comprising said information data and 
said initial header and transmitting said data stream 
to said router connected thereto as a message 
packet 

15. The system of claim 4 wherein: 

said metwork is one-dimensional and comprises an 
asynchronous pipeline of succesive ones of said 
computers in which said routers are connected in a 
cascaded succession of routers each comprising 
one routing automata; and 
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said means for decrementing comprises means for 
providing an asynchronous request signal to a suc- 
cessive one of said routing automata. 

16. The system of claim 2 wherein said plural decision 
element means are a first set of substan t ially identical 5 
building block elements, said plural merge element 
means are a second set of substantially identical building 
block elements and wherein each of said routing autom- 
ata are of substantially identical structure. 

17. An inter-computer message routing system 10 
wherein message packets are routed among a plurality 
of computers along communication paths between said 
computers in an n-dimensional network of said commu- 
nication paths, different groups of said communication 
paths comprising different ones of the n dimensions of 15 
said network, said message packets each comprising a 
header containing successive routing directions relative 

to successive computers along a selected route in said 

network, said system comprising: 
a plurality of routers, each router being associated 20 
with a corresponding one of said computers, each 
of said routers comprising n routing automata cor- 
responding to said n dimensions, each of said n 
routing automata having plural message packet 
inputs and plural message packet outputs, at least 25 
some of said inputs and outputs being connected to 
respective communication paths of the correspond- 
ing one of said n dimensions, said n routing autom- 
ata being connected together in cascade from a 
message packet output of one to a message packet 30 
input of the next one of said routing automata cor- 
responding to a sequence of dimensions of the rout- 
ing automata; 
routing logic means disposed within each one of said 
routing automata for directing a message packet 35 
received from one of the inputs of said one routing 
automata to one of said outputs of said one routing 
automata in accordance with the contents of said 
header, whereby said routing automata performs 
all message routing for the message packets travel- 40 
ing in a corresponding one of said dimensions. 

18. The system of claim 17 wherein said means for 
directing said message packet to one of said outputs 
comprises plural decision element means each having 
one decision input and plural decision outputs for dt- 45 
recttng a message packet received at said decision input 

to one of said decision outputs in accordance with the 
contents of said header, and plural merge element 
means having plural merge inputs and a single merge 
output for directing message packets received at the 50 
plural merge inputs thereof to said merge output with- 
out interference between said message packets. 

19. The system of claim 18 wherein within a single 
one of each routing automata at least some of said deci- 
sion inputs are connected to respective message packet 55 
inputs, at least some of said merge inputs are connected 
to respective decision outputs and at least some of said 
merge outputs are connected to respective mes sa g e 
packet outputs. 

20. The system of claim 19 wherein: 60 
travel by said message packets along said paths is 

mono-directional and said system is one of a group 
of networks com pris ing 2*dimensk>naJ mesh net- 
works and 3 H*™"™" 1 "*! torus networks; 
there are two message packet inputs in each routing 65 
aatotmata comprising one message packet input 
which receives message packets traveling in the 
co r re s po n ding dimension and another message 



packet in^ut connectable to a message packet out- 
put of a preceding routing automata connected in 
cascade therewith; and 
there arc two message packet outputs in each routing 
automata comprising one message packet output 
which transmits message packets for travel in said 
co r re sp onding dimension and another message 
packet output connectable to a message packet 
input of a succeeding routing automata in cascade 
therewith. 

21. The system of claim 17 wherein: 

travel by said message packets along said paths is 
bi-directional; 

there are three message packet inputs in each routing 
autotmata comprising first and second message 
packet inputs which receive message packets trav- 
eling in first and second directions in the corre- 
sponding dimension respectively, and a third mes- 
sage packet input connectable to a message packet 
output of a preceding routing automata connected 
in cascade therewith; and 

there are three message packet outputs in each rout- 
ing automata comprising first and second message 
packet outputs which transmit message packets for 
travel in first and second directions in said corre- 
sponding dimension respectively, and a third mes- 
sage packet output connectable to the third mes- 
sage packet input of a succeeding routing automata 
in cascade therewith. 

22. The system of claim 17 wherein each of said com- 
puters comprise an integrated circuit including the cor- 
responding router, a processor, a memory, a packet 
interface connected to said router and a bus connecting 
said packet interface, said memory and said processor, 
wherein: 

said processor comprises means for retrieving infor- 
mation data from said memory to be routed in said 
network in a message packet and for computing an 
initial header based upon destination instructions 
stored in said memory; and 

said packet interface comprises means for forming a 
data stream comprising said information data and 
said initial header and transmitting said data stream 
to said router connected thereto as a message 
packet. 

23. The system of claim 18 wherein: 

said metwork is one-dimensional and comprises an 
asynchronous pipeline of succesive ones of said 
computers in which said routers are connected in a 
ctfcy*'** succession of routers each comprising 
one routing automata; and 
said routing logic means comprises means for provid- 
ing an asynchronous request signal to a successive 
one of said routing automata. 
24 The system of claim 18 wherein said plural deci- 
sion element means are a first set of substantially identi- 
cal building block elements, said plural merge element 
means are a second set of substantially i d entic a l building 
block elements and wherein each of said routing autom- 
ata are of substantially identical structure. 

25. A computer node chip for use in an inter-com- 
puter message routing system wherein message packets 
are routed among a plurality of such computer node 
chips along communication paths between said com- 
puter node chips in an n-dimensional network of said 
conimuntcation paths, different groups of said commu- 
nication paths comprising different ones of the n dimen- 
sions of said network, said message packets each corn- 
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prising a header containing successive routing direc- 
tions relative to successive computer node chips along a 
selected route in said network, said computer node chip 
comprising: 

a router comprising n routing automata correspond 
ing to said n dimensions* each of said n routing 
automata having plural message packet inputs and 
plural message packet outputs, at least some of said 
inputs and outputs being connected to respective 
communication paths of the co r re sp o n ding one of 
said n dimensions, said n routing automata being 
connected together in cascade from a message 
packet output of one to a message packet input of 
the next one of said routing automata correspond- 
ing to a sequence of dimensions of the routing 
automata; 

routing logic means disposed within each one of said 
routing automata, said routing logic means com- 
prising means for reading the header of a message 
packet received from one of the inputs of said one 
routing automata, means for directing said message 
packet to one of said outputs of said one routing 
automata in accordance with the contents of said 
header, and means for modifying said header to 
reflect the passage of said message packet through 
said one routing automata, whereby each of said 
routing automata performs all message routing for 
the message packets traveling in a corresponding 
one of said dimensions; 

a processor; 

a memory; and 
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a packet mtrrfacr r own fictff d to said router and a bus 
ronnrcting said packet interface, said memory and 
said processor, wherein said processor comprises 
means for retrieving information data from said 
memory to be routed in said network in a message 
packet and for computing an initial header based 
upon destination instructions stored in said mem- 
ory, and said packet interface comprises means for 
forming a data stream comprising said information 
data and said initial header and transmitting said 
data stream to said router connected thereto as a 
message packet 

26. The chip of claim 25 wherein: 

ftftirf means for directing said message packet to one of 
said outputs c omp rises plural decision element 
means each having one decision input and plural 
decision outputs for directing a message packet 
received at said decision input to one of said deci- 
sion outputs in accordance with the contents of 
said header, and plural merge element means hav- 
ing plural merge inputs and a single merge output 
for directing message packets received at the plural 
merge inputs thereof to said merge output without 
interference between said message packets; and 

within a single one of each routing automata at least 
some of said decision inputs are connected to re- 
spective message packet inputs, at least some of 
said merge inputs are connected to respective deci- 
sion outputs and at least some of said merge outputs 
are connected to respective message packet out- 
puts. 
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